Cofea: correlation-based feature selection for single-cell chromatin accessibility data

Abstract Single-cell chromatin accessibility sequencing (scCAS) technologies have enabled characterizing the epigenomic heterogeneity of individual cells. However, the identification of features of scCAS data that are relevant to underlying biological processes remains a significant gap. Here, we introduce a novel method Cofea, to fill this gap. Through comprehensive experiments on 5 simulated and 54 real datasets, Cofea demonstrates its superiority in capturing cellular heterogeneity and facilitating downstream analysis. Applying this method to identification of cell type-specific peaks and candidate enhancers, as well as pathway enrichment analysis and partitioned heritability analysis, we illustrate the potential of Cofea to uncover functional biological process.


Supplementary Notes Supplementary Note 1: Detailed descriptions of the three baseline methods
Selecting peaks with highest degree of accessibility (HDA) is the simplest and most widely used feature selection method in scCAS data analysis 1,2 .The HDA algorithm aggregates the read counts of each feature across all single-cells, and ranks the features based on this cumulative count.The features with the highest degree of accessibility are selected as the "informative features".epiScanpy 3 is one of the most commonly used toolkits for scCAS data analysis in Python language, and feature selection is a step in its preprocessing pipeline.epiScanpy considers the most variable features to open only in half of the cells, and the least variable features to open in none or all of the cells.To this end, epiScanpy defines the variability score (VS) of a feature as the following formula: where   denotes the element in the raw peak-by-cell count matrix  ∈ ℝ × .The range of variability scores extends 0.5 to 1.A score of 0.5 denotes peaks that are consistently accessible or inaccessible across all cells, while a score of 1 indicates peaks that are open in approximately half of the cells and closed in the other half.Consequently, epiScanpy attributes significance to peaks with variability scores closest to 1, considering them to encapsulate the most informative patterns and epiScanpy selects a user-defined number of features with the highest variability score as the "informative features".Signac 4 is one of the most widely used toolkit for scCAS data analysis in R language.In its pipeline, feature selection is performed after peak calling and cell filtering.Signac performs TF-IDF transformation on the peak-by-cell matrix, followed by computing the quantile of the sum of each row in the transformed matrix.The quantile indicates the importance score of each feature, and features with higher quantiles are considered as the "informative features".

Supplementary Note 2: The number of selected features
In our study, in order to provide a fair and quantitative comparison among different feature selection methods, we set an equal number (5,000, 10,000, 15,000, 20,000, and 25,000) of peaks selected by different methods on each dataset.
We also considered two other feature selection strategies for benchmarking: 1) Similar to Signac, fix the percentage of peaks selected for each dataset; 2) first select peaks that are accessible in no less than X% (X denotes a number set by users) of cells (as with the HDA method), and then compare Cofea, Signac and epiScanpy using the same number of selected peaks.However, these strategies have limitations: 1) the number and the accessibility degree of peaks vary across datasets, making it challenging to quantitatively benchmark different feature selection methods among various datasets; 2) it is not conducive to detect how many peaks each method needs to select to reproduce cell-type annotation results with all peaks.
The baseline methods we employed, namely epiScanpy and Signac, both provide a default number of selected features within their source code.Specifically, for epiScanpy, the feature selection function is named as "select_var_features", which calculates variability scores to rank and select peaks (details in Supplementary Note 1).As indicated in the documentation of the 'select_var_feature' function, there are two parameters to determine the number of selected peaks: 'min_score' and 'nb_feature'.The 'min_score' sets the minimum threshold of variability scores necessary to retain features, while 'nb_feature' specifies the exact number of features to be selected.Therefore, the default values for these parameters are '0.5' and 'None', respectively, which signifies that by default, the 'select_var_feature' function retains all available features.For Signac, its feature selection function ('FindTopFeatures') utilizes a quantile parameter ('min.cutoff') to control the number of selected peaks (details in Supplementary Note 1).The default value for 'min.cutoff' is 'q5', implying that the function defaults to selecting 95% of peaks.We had conducted several preliminary experiments and found that only a minimal disparity existing in the quantities between selecting 95% of peaks and retaining the entirety of peaks, leading to similar results on downstream analyses.Utilizing the default parameter settings for feature selection from epiScanpy or Signac would lead to the compromise of comparative significance.
Certainly, both of the aforementioned workflows also provide tutorials for analyzing scCAS data.Within these tutorials, both workflows tend to select the majority of peaks for downstream clustering analysis.However, as mentioned previously, opting for the majority of peaks would lead to comparable performance among different feature selection methods, making it challenging for benchmark.Therefore, we do not choose to compare Cofea with the default (or frequently used) number for feature selection in Signac or epiScanpy with baseline methods.

Supplementary Note 3: UMAP visualization on Luecken2021 dataset
To validate the performance of Cofea on datasets having a large number of cell types, we conducted a visualized analysis on Luecken2021 dataset.The Luecken2021 dataset comprises a total of 69,249 cells across 22 distinct cell types, making it a multi-cell type dataset.Applying Cofea and baseline methods to this dataset, we selected 20,000 peaks, and performed dimensionality reduction on these resulting matrices, respectively.We then performed UMAP projection and visualized all cells of the dataset in a 2-dimensional space.As shown in Supplementary Figure 1, compared to baseline methods, biological variations between cell types are better distinguished when using peaks selected by Cofea.Notably, Cofea also achieved superior performance in uncovering few-sample cell types.
For example, both MK/E prog cells (a total of 884 cells, highlighted in red rectangles) and cDC2 cells (a total of 859 cells, highlighted in green rectangles), two cell types with less than 1,000 cells, are distinctly separated from other cell types using features selected by Cofea.Conversely, other methods intermix these two cell types with others.These results demonstrate that Cofea can effectively preserve cell heterogeneity and uncover few-sample cell-types.

Supplementary Note 4: Discussion of the minor performance disparities among the three baseline methods
During the processes of dimensionality reduction and cell clustering, it is evident that the performances of three baseline methods are very similar to each other, both across distinct datasets and various numbers of selected peaks.Such results can be attributed to the underlying principles of three baseline methods.HDA selects features with the highest degree of accessibility based on the raw count peak-by-cell matrix.epiScanpy tends to select features that are open in approximately half of the cells.However, due to the fact that in the majority of datasets, only a very limited number of features are open in over half of the cells, epiScanpy's results typically align with those of HDA.
Signac first performed TF-IDF transformation on the peak-by-cell matrix, and then selected features based on the sum of each row in the transformed matrix.In other words, Signac also prioritizes the selection of features based on the degree of accessibility, but it incorporates normalization for the varying accessibility values of different features across cells.The fundamental essence of three baseline methods lies in selecting features with the highest degree of accessibility.This is precisely why their computational outcomes exhibit similarities across the majority of datasets.

Supplementary Note 5: Performance precision examination
As demonstrated in the two recent benchmarking papers 5,6 , both of them performed cell clustering on scCAS data, while achieved higher metrics (such as ARI and ASW) than that in our experiment upon the same dataset.We have conducted an investigation into this discrepancy and found that it is primarily arise from variations in the datasets and the differences in the experimental settings of data processing.
Despite both benchmark papers using the Buenrostro2018 dataset, there are inconsistencies in the dataset details and the processing procedures applied compared to ours.The Buenrostro2018 dataset we used for benchmarking was downloaded from GSE96772, which comprises 2034 cells from 13 cell types.We filtered the raw data based on read counts, selecting peaks that were accessible in more than 1% of the cells.From the TF-IDF transformed matrix, we extracted the peaks selected by Cofea and baseline methods (or retained all peaks when no feature selection was performed), and conducted PCA transformation for dimensionality reduction.Subsequently, we applied Louvain clustering, and determined an appropriate resolution to ensure the number of clusters matches the number of labels through a binary search approach.Luo et al.'s study used the Buenrostro2018 dataset consisting of 1711 cells across 9 cell types.They employed the Leiden algorithm for cell clustering and adjusted resolutions to generate different numbers of clusters.While we did not find a download link for the processed dataset in their paper or GitHub repository, it appears that the cell clustering task have been simplified by label merging or data preprocessing, which could contribute to achieving higher performance.The Buenrostro2018 dataset utilized by Chen et al. comprised 2034 cells across 10 cell types.Upon inspecting the dataset provided in the paper, we noticed that they merged the cell types 'GMP3high', 'GMP2mid', and 'GMP1low' from the original dataset into the cell type 'GMP', which also simplified the cell clustering task to a certain extent.Chen et al. evaluated the performance of various data processing methods, among which we focused on the TF-IDF+PCA method (referred to as Cusanovich2018 in the benchmarking paper, but renamed here to avoid confusion with dataset names).This method is similar to Signac, as both involve TF-IDF and PCA transformation on the input matrix.
Following the data process approach and experimental settings outlined in benchmarking paper of Chen et al., we aggregated four cell types from the Buenrostro 2018 dataset ('GMP3high', 'GMPmid', 'GMP1low', and 'GMP') and then applied Louvain clustering based on the binary search method to calculate ARI score.We maintained consistency with the experimental settings outlined by Chen et al. in their benchmarking study, which involves retaining 150 principal components through PCA transformation and setting the size of the local neighborhood to 15 for Louvain clustering.While the obtained ARI value of 0.440 still trailed behind TF-IDF+PCA (referred to as Cusanovich2018 in the paper) which could be attributed to the data preprocessing performed by TF-IDF+PCA, it outperformed 13 out of the 17 methods tested in the benchmarking, such as Cicero+PCA and SCRAT+PCA.This indicates that the ARI and other metrics obtained in our experiments are within a reasonable range.
For atlas HGCA, the imprecision in manual cell type annotation resulted in an excessive number of cell types within the datasets.We did not conduct preprocessing steps such as cell filtering, which could lead to the inferior clustering performance than the results in the benchmarking papers.In summary, the low ARI scores can be attributed to various factors, including differences in the dataset itself, preprocessing steps, and experimental settings.The purpose of calculating baseline performances without feature selection was to observe whether the feature selection methods can facilitate downstream analysis, thereby evaluating their performance.

Supplementary Note 6: Comparative analysis of selected peaks and known chromatin regions
We employed both Cofea and baseline methods to identify 5000 peaks from the MCA brain dataset and subsequently compared these peaks with candidate cis-regulatory elements individually.As shown in Supplementary Figure 6A in our manuscript, the peaks identified by Cofea comprised a greater number of enhancers and a reduced count of promoters than baseline methods.Both Cofea and the baseline methods identified relatively few CTCF-only and DNase-H3K4me3 regions.
We have noticed that a limitation in Cofea stemming from the relatively low count of PLS regions encompassed within the identified peaks.As clarified in Supplementary Note 14, we conducted a comparison of the identified peaks and promoter regions of housekeeping (HK) genes.As shown in Supplementary Figure 6B, the substitution of overlapped proportion between the selected peaks and TSS did not significantly change because of the substitution of PLS.Upon comparing the identified peaks from baseline methods with the promoter regions of HK genes, HDA, epiScanpy, and Signac overlapped 1532 (30.64%), 1532 (30.64%), and 1592 (31.84%) with these regions among the 5000 peaks they selected, respectively.In contrast, Cofea identified only one peak overlapping with the promoters of HK genes.Considering HK genes are required for fundamental cellular processes, they express in the vast majority of cell types and lack cell type-specific characteristics.In fact, the number of genes associated with specific cell types is significantly lower than that of non-HK genes, often comprising only a minority.Therefore, a higher count of promoters within the selected peaks does not imply that they encompass a greater degree of cellular heterogeneity information or a superior ability to discriminate cell types.Subsequently, we followed the analysis procedure outlined by Anderson, A. G. et al., and conducted a comparative analysis of the overlap distribution of selected peaks with all the candidate cis-regulatory elements.After removing duplicate overlapping samples, we constructed and presented the distribution of selected peaks overlapping with candidate cis-regulatory elements (Supplementary Figure 6C).The vast majority (more than 95%) of the peaks selected by baseline methods accurately captured known cis-regulatory elements.While among the peaks selected by Cofea, the count of the 'other' region (the number of peaks that do not overlap with any of the five elements) was 2627, constituting 52.54% of the total peaks.The disparity between the results of Cofea and baseline methods arises from their distinct priorities on retaining information.Most of these elements are derived from bulk sequencing data.Meanwhile, as we mentioned in the Section 'Introduction' of our manuscript, peaks selected by baseline methods tend to prioritize information from cell types with a substantial number of cells.Both of the elements and the selected peaks from baseline methods capture similar information of the major cell type, resulting in a higher overlapped proportion.Cofea, on the other hand, prioritize to capture information related to different cell types.Furthermore, we conducted an analysis on the peaks identified by Cofea that do not overlap with candidate cis-regulatory elements downloaded from the SCREEN database.Utilizing the 2627 peaks within 'other' regions of Cofea, we performed dimensionality reduction to 10 with PCA and Louvain clustering with a binary search to ensure the number of clusters matches the number of cell types.The cell clustering yielded NMI, ARI, Homo, and AMI scores of 0.512, 0.324, 0.510, and 0.511, respectively.Cofea outperformed the baseline methods across all the clustering metrics (NMI, ARI, Homo and AMI).As shown in Supplementary Figure 7A which visualized the clustering results via UMAP, Cofea demonstrated a superior ability to differentiate various cell types within the MCA brain compared to the baseline methods.We compared the 2627 peaks identified as 'other' from Cofea agayj inst all of the selected 5000 peaks from baseline methods to mitigate this potential bias.In this comparison, baseline methods gained a potential advantage by leveraging a larger set of peaks for dimensionality reduction and cell clustering.However, as shown in Supplementary Figure 7B, the clustering metrics of Cofea remains superior to that achieved by baseline methods.
Taking the NMI as an example, when utilizing the 2627 peaks identified as 'others' by Cofea for Louvain clustering, the NMI value was 0.511.In contrast, the use of 5000 peaks selected by baseline methods consistently yielded metrics below 0.4.This underscores Cofea's capacity to extract information that is relevant to different cell types.These results indicate that the peaks selected by Cofea contain more information related to cellular heterogeneity, even though they do not overlap entirely with known cis-regulatory elements.The distribution of selected peaks within known cisregulatory elements provides us comprehensive information on the type of informative peaks on real datasets.However, based on our experiments, peaks that enriched with cellular heterogeneity may not necessarily have a high degree of overlap with candidate cis-regulatory elements.

Supplementary Note 7: Extended dropout experiments
To verify whether Cofea is able to adapt to the technical noise, we downsampled the reads on the Buenrostro2018 dataset with various dropout rates.More specifically, given a dropout rate, we randomly set the nonzero elements in the peak-by-cell count matrix to zero.We compared the overlapped proportion of selected features on raw data (referred as to raw-set features) with features selected on noised data (referred as to noised-set features).Interestingly, we observed a trend when incorporating more noise with count matrix, features selected by Cofea have a smaller overlap with raw-set features, while baseline methods basically remain the original selection (Supplementary Figure 9A).To validate if such the change of Cofea is beneficial for downstream analysis, we performed dimensionality reduction and cell clustering on the noised data with raw-set features and noised-set features, respectively.As shown in Supplementary Figure 9B, Cofea with noised-set features achieved a relatively better performance, suggesting that the feature selected by Cofea are adaptable to technical noise.
where   denotes the element in the peak-by-cell matrix  ∈ ℝ × , and   ′′ is the corresponding element in the newly formed matrix  ′′ ∈ ℝ × after TF-IDF_original transformation.
The TF-IDF_scOpen transformation uses a characteristic function I(  > 0) to represent whether peak  is open or closed in cell : The TF-IDF_scOpen transformation can be calculated based on this representation: where   ′′′ is the corresponding element in the newly formed matrix  ′′′ ∈ ℝ × after TF-IDF_scOpen transformation.We utilized the implementation presented in the scOpen source code, specifically the TfidfTransformer model from scikit-learn package with the same parameter settings as in scOpen.

Supplementary Note 9: Computational efficiency experiments
We assessed the computational efficiency and scalability of Cofea using the HGCA esophagus dataset, which contains 82,469 cells and 101,251 peaks after preprocessing.We randomly downsampled cells or peaks to obtain a variety of newly-formed datasets with different sizes, and then benchmarked the running time and peak memory usage of different feature selection methods.
Note that since Cofea and baseline methods evaluates and ranks all the peaks regardless of how many features users require, the number of selected features has no effect on the evaluation.As shown in Supplementary Figure 10A, the running time and memory use of Cofea increase as the number of cells or peaks increases.More specifically, when the dataset contains more than 50,000 cells, running time required for the stepwise preprocessing takes up the majority of the total running time and was proportional to the size of the dataset.The running time taken for correlation calculation and fitting steps, on the other hand, is hardly affected by the number of cells due to the implementation of cell-wise PCA steps (Supplementary Figure 10B).We also compared Cofea with other baseline methods on the complete HGCA heart dataset, and it is intuitive that, because of the need to obtaining inter-peak correlation in Cofea, the computational expense is larger than methods based on simplistic statistics of individual peaks (Supplementary Figure 10C).In addition, we conducted runtime assessments of the Cofea and baseline methods across all datasets, with the results presented in Supplementary Table 1.
We recorded the time consumed by the steps of data analysis on datasets of different scales.As examples, we used the datasets of FCA intestine, FCA muscle and HGCA artery, which represent increasing numbers of peaks.Specifically, the FCA intestine dataset contains 42942 cells and 64647 peaks, the FCA muscle dataset comprises 27181 cells and 96720 peaks, and the HGCA artery dataset includes 46283 cells and 158344 peaks.Without performing feature selection, the entire analysis process took 702.06s,559.99s, and 1059.20s on the three datasets, respectively.When using Cofea for feature selection, the total processing times were 724.42s, 778.34s, and 1589.12s,respectively.
Notably, the feature selection step accounted for a significant portion of the total time, specifically

Supplementary Note 10: Tailored parallel arithmetic strategy
As Cofea is a correlation-based feature selection method, obtaining correlation coefficients between peaks is an essential step in its procedure.However, as the number of peaks typically reaches 100,000, the peak-by-peak correlation matrix is exceedingly high dimensional, and the elements within it are continuous and dense.This will raise issues such as memory crush or intolerable running time.Moreover, as mentioned in the framework of Cofea, only the mean value and mean square value of inter-peak correlation coefficients are required, rather than the whole peak-by-peak correlation matrix.If the element   in  is computed iteratively, that is, for each peak, an inner loop is executed to obtain Pearson correlation coefficients (PCC) between that peak and the other peaks, and then store the mean value and mean square value.This process is repeated  times, using an outer loop to iterate over each peak, which will take up plenty of time, causing low computing efficiency.Therefore, we tailored a parallel arithmetic strategy to obtain the correlation between peaks, computing PCC between  (specified by the user) peaks and all peaks at one time.
More specifically, we used matrix operations to accelerate the computation.We took out a submatrix from the PCA-transformed matrix , referred to as  () ∈ ℝ × , containing the th to ( + )th peaks and all cells.Then we extended the formula of PCC into matrix operation and calculated the peak-peak correlation matrix   ∈ ℝ × , which contains elements from  1 to  (+) in .
We first calculated a vector  ∈ ℝ  using the formula:

𝑗
where   is an element of , representing the variance of the accessibility of peak  across all cells.A matrix  ∈ ℝ × is generated from  by vertically replicating  times.Next, we calculate the correlation matrix as: where  ̅ ∈ ℝ × and  ̅ () ∈ ℝ × are matrices that element denotes the mean value of the corresponding row in  and  (𝑖) , therefore the elements in the same row share a consistent value.
Diag() denotes a function that keeps only the diagonal elements of  and sets other elements to 0.
cell type B (750 cells) and cell type C (750 cells).For each cell type, cell type-specific peaks account for 5% of all peaks.For background peaks, we set the degree of accessibility ranging from 0% to 5% with a uniform distribution across all cells.For cell type-specific peaks, we set the degree of accessibility ranging from 5% to 10% in one cell type and from 1% to 3% in the other cell types.
S4 contains 100,000 peaks and 4,000 cells, including three common cell types (cell type A, B and C) with 1,200 cells within each cell type, and a rare cell type D with 400 cells.For each cell type, cell type-specific peaks account for 5% of all peaks.For background peaks, we set the degree of accessibility ranging from 0% to 5% with a uniformly distribution across all cells.The degree of accessibility for cell type-specific peaks and background peaks were set the same as S3.
In comparison to dataset S1 to S4 with discrete cell populations, dataset S5 is constituted by 750 cells with a continuous differentiation trajectory, that is, 250 cells of cell type A differentiating into 250 cells of cell type B and 250 cells of cell type C. Using S5 for illustration, we tested whether features identified by Cofea could be effectively implemented in trajectory inference.As a side note, here the cell type-specific peaks are no longer more accessible in a tailored cell type, but relevant to cell differentiation, which is called differentiation-specific peaks.We set the number of peaks to 100,000 and treated 10% of all peaks as differentiation-specific peaks.To reflect cell differentiation from the accessibility of peaks, we generate the value of th peak and th cell from a Bernoulli distribution, of which the probability parameter   follows a Brownian motion.More specifically, for each peak, the increment  between  th cell and  + 1 th cell is sampled from a Gaussian distribution (0,0.0001),that is  (+1) =   + , and for the root cell,  1 are fixed to 0.05.
Note that to generate high-fidelity data, the parameter   is restricted to a range of [0,0.2].The accessibility of background peaks is expected to be consistent across different cell types.To differentiate between differentiation-specific peaks and background peaks, we randomly shuffled the accessibility of each background peak in cells while maintaining the range of 0 and 0.2.
using Louvain clustering algorithm based on the peak-by-PC matrix generated by the intermediate step of Cofea.To gain functional insights into these features, we performed pathway enrichment on each cluster of features using GREAT analysis 10 , and compared the enriched pathways with cellular functions of cells in a scCAS dataset, to detect a deeper understanding of the biological functions encoded within these features.
Fourth, we analysed the heritability of five brain-related phenotypes within specific partial peaks using partitioned linkage disequilibrium score regression (S-LDSC) 11,12 .To achieve this objective, we first mapped the genome coordinates of the peaks identified by Cofea to GRCh37/hg19.S-LDSC was then applied to each phenotype to estimate the enrichment of heritability within the identified peaks.To perform the analysis, we followed the recommended LDSC workflow, using HapMap3 SNPs and European samples from the 1000 Genomes Project as the LD reference panel.The GWAS summary statistics were downloaded from https://data.broadinstitute.org/alkesgroup/sumstats_formatted/.
Furthermore, we conducted an analysis on the peaks identified by Cofea that do not overlap with candidate cis-regulatory elements downloaded from the SCREEN database.Utilizing peaks within 'other' regions of Cofea, we performed dimensionality reduction to 10 with PCA and Louvain clustering with a binary search to ensure the number of clusters matches the number of cell types.
To address the potential bias, we compared the 2627 peaks identified as 'others' from Cofea against all of the selected 5000 peaks from baseline methods.Furthermore, we employed UMAP for visualizing the clustering results.

Supplementary Note 15: Metrics for assessment of dimensionality reduction and cell clustering
We denote  as the ground-truth cell labels,  as the predicted cluster labels, and the NMI score can be calculated by the following formula: where E(•) is the expectation function.We then suppose that   is the th cell label,   is the  th cluster label,   is the number of cells simultaneously belonging to   and   ,  • is the number of cells that belong to   ,  • is the number of cells that belong to   , and  is the total number of cells.The ARI score is calculated as follows: The Homo score can be calculated by the following formula: where H(|) is the entropy of the ground-truth cell labels based on the predicted cluster labels.
For all the four metrics, a higher score closed to 1 indicates that the clustering results are more consistent with the ground-truth labels.
The silhouette width (SW) measures the relationship between the distances of a cell from other cells in the same cell type (referred to as inner-cell-type distances) and the distances of the cell from cells in the closest cell type (referred to as inter-cell-type distances).The SW score for the th cell is computed as follows: where   is the average of inner-cell-type distances and   is the average of inter-cell-type distances for cell type .ASW is the average of all the silhouette widths of a set of cells and for the bio-conservation score, we scaled the ASW to a value between 0 and 1 as suggested in scIB 14 , assessing whether the clusters are dense and well-separated.ASW can be calculated as: where  is the total number of cell types in the dataset.
cLISI is used to measure the accuracy of an embedding with cell-type prediction.Specifically, after mixing the cells, cLISI predict type for each cell based on its 30 nearest neighbors in lowdimensional space, with the weighted sum of their cell types to find out the most likely prediction.
A cLISI of 1 reflects successful separation of cell types, that is, different cell types group separately instead of together.
data.In summary, feature selection methods specifically designed for scRNA-seq data have evident errors and limitations when applied to scCAS data, and consequently, we did not continue to serve these methods as baseline method for comprehensive benchmarking.
431.93s, 650.76s, and 1064.07s,while downstream analysis occupied a relatively smaller portion of the total time, approximately 292.49s, 127.58s, and 525.05s.While feature selection may slightly extend the overall running time, it is an indispensable strategy in scCAS data analysis.Adjustments to downstream analyses, such as clustering, are often required during the analysis process, and the efficiency advantages of feature selection become remarkble in such iterative analyses.Moreover, in terms of performance, conducting feature selection consistently yields superior results, as demonstrated in Figure3Aand 3D of our manuscript.Dimensionality reduction and cell clustering were the benchmarks we employed to evaluate feature selection methods, hence our selection of the most efficient and commonly used methods (TF-IDF and PCA transformation) in our paper.However, in practical applications, there is a growing trend towards the use of deep learning approaches to achieve more accurate cellular representations.These methods typically involve forward and backward processes, leading to slower convergence rates compared to the conventional methods.To address this, we applied a deep learning model, SCALE, as a replacement for TF-IDF and PCA on the data and recorded the processing time with and without feature selection.On the three datasets, running times of the entire data processing pipeline without feature selection were 2296.90s,1784.20s, and 2339.93s.In contrast, the complete data processing pipeline with feature selection included took the following times: 1170.37s,1327.70s, and 1874.54s.After performing feature selection, the GPU utilization during training on a GeForce RTX 2080 GPU decreased from an average of 92% to 48%.This reduction allowed for an increase in the batch size from 64 to 128, resulting in faster model convergence and reducing the overall data analysis time to 1145.69s, 1301.19s, and 1849.24s on the three datasets, respectively.Through the extraction of features rich in cellular heterogeneity, Cofea reduces the dimensionality of features, thereby enhancing the efficiency of the data analysis process, particularly for deep learning models.

4 Supplementary Figure 4 .
, respectively, on datasets in FCA atlas.For each feature selection method, we set 5000, 10000, 15000, 20000 and 25000 as the number of selected features.The measure of center for the error bars denotes the median value of the metric on different datasets in an atlas, and the error bar denotes the maximum and minimum value after removing the outliers, which are defined as the values whose distance from the median is greater than 1.5 times the quartile distance.The black dotted line denotes the median score of cell clustering without feature selection, and the gray dotted lines denote the first and third quartiles.Supplementary Figure Evaluation using the remaining five metrics with various number of selected features on datasets in HGCA atlas.
other' regions of Cofea and 2627 randomly selected peaks of baseline methods.B, Metrics (NMI, ARI, Homo, and AMI) for Louvain clustering using 2627 peaks from the 'other' regions of Cofea and 2627 peaks randomly selected out of 5000 peaks from baseline methods, comparing to the metrics for clustering using full set of 5000 peaks selected by Cofea and baseline methods.
data and scCAS data.Perform feature selection using (A) HVG, (B) M3Drop, (C) NBDrop on the Peripheral blood monoculear cell (PBMC) and MCA Kidney datasets.PBMC dataset is a scRNA-seq dataset and arrayed on the left, while MCA Kidney is a scCAS dataset and arrayed on the right.D, Feature selection on the PBMC dataset using GiniClust.Due to the binarized nature of the scCAS data, GiniClust throws an error during the computation process, so the corresponding results are not provided here.The points in the figure are representations of features under specific two-dimensional characteristics, and the lines are obtained according to the fit of all genes.The highlighted points (pink points in A and orange points in B and C, respectively) are the informative features that have been selected, while the black and grey points serve as noninformative features.